Georeferencing Annual Household Surveys in Rural Madagascar: A Comprehensive Analysis of Observatories (1995-2014)

Author

Florent Bédécarrats

Abstract
This technical appendix accompanies a forthcoming data paper. It focuses on the georeferencing process of household panel data collected by the Rural Observatory Network (ROR) from 1995 to 2014. The ROR captured information about the diverse rural contexts in Madagascar. While the quality data gathered was carefully curated, its geolocation attributes were not fully curated, restricting its utility in spatially integrated research or public policy evaluation. This appendix systematically aligns the varying spatial denominations found in the ROR dataset with a standardized geographic database, to provide a consistent correlation between each ROR observation and corresponding geospatial entities, namely communes and fokontany. We do so leveraging computational notebooks, specifically the Quarto format, utilizing the R programming language. This interactive setup fosters a blend of code, outcomes, explanations, and multimedia resources, ensuring that the ROR data becomes more dependable and user-friendly for spatial analysts and other researchers.

Note: To pilot-test our georeferencing approach, we tested it with 8 observatories (out of the 26 of the rural observatory network). We prioritized the observatories that were in the vicinity of protected areas, as it exemplifies the usefulness of our approach. This procedure will soon be extended to the 18 other observatories.

Introduction

This document serves as a technical appendix accompanying the paper tied to opening of the data collected by the Rural observatory network (ROR, for its acronym in French) from 1995 to 2014. The ROR was a panel survey system that monitored the living conditions of populations in selected localities representing the diversity of rural contexts in Madagascar. While the ROR data underwent a rigorous supervision and curation process during its collection phase, complemented by an exhaustive harmonization effort subsequently, its geolocation attributes remain largely incomplete. This absence of usable geographic reference in the data hinders its integration with other spatially-referenced datasets, particularly when one contemplates its utility in realms such as public policy formulation or project evaluation.

This study endeavors to bridge this gap through a meticulous georeferencing process. The primary objective is to reconcile the somewhat erratic spatial denominations, or toponyms, found within the ROR dataset with a more standardized spatial database. Ultimately, this effort aims to establish a congruent correspondence between each observation from the ROR survey and both a polygon representing a Malagasy municipality and a referential village point.

To achieve this in a reliable way that enables external verification and subsequent improvements, we harness the capabilities of computational notebooks (Quarto format). As interactive computing environments, these notebooks facilitate an amalgamation of code, computational output, explanatory text, and multimedia resources. We use R for the code, as it offeres a high versatility to handle spatial data, complex surveys and textual informaiton. By adopting this methodology, we aim to bolster the reliability and usability of the ROR data for spatial analysts and researchers.

Code
library(tidyverse)    # A series of packages for data manipulation
library(haven)        # Required for reading STATA files (.dta)
library(labelled)     # To work with labelled data from STATA
library(sf)           # for spatial data handling
library(tidyverse)    # for data wrangling and visualization
library(stringdist)   # for string distance and matching
library(tmap)         # for mapping
library(fuzzyjoin)    # for fuzzy joining
library(readxl)       # Read data frames to Excel format
library(writexl)      # Write data frames to Excel format
library(gt)           # for nicely formatted tables
library(cowplot)      # to combine plots
library(gtsummary)    # to produce nice summary tables
library(janitor)      # to simply add rowsums

Revisiting the Rural Observatories (ROR) of Madagascar

The Rural Observatory Network (ROR) was established as part of the Madio project (1994-1999), a technical assistance program of the IRD to the national statistical institute (INSTAT) of Madagascar, as an innovative mechanism for monitoring and analyzing the country’s agricultural sector. This initiative arose from the urgent need to compile statistics on the agricultural industry due to the breakdown of the existing statistical system following the socialist era and political turmoil in the country.

Four initial observatories were established in 1995, each representing a different set of agricultural challenges in the country’s diverse ecosystems. From 1995 to 2015, 22 other observatories were created, and existing were abandoned (see Figure 1). The ROR were supported by a variety of donors and NGOs during this period: European union, AFD, World Bank, GTZ, Swiss cooperation, Care and Rio Tinto. The selection of observation sites was based on a combination of criteria, including agroclimatic areas, dominant production systems, population density, accessibility, and the presence of support structures and development projects. These observatories aimed to provide insights into the dynamics of Malagasy agriculture, covering a range of issues such as isolated coastal communities of fishermen and livestock farmers, the impact of trade liberalization on vanilla producers, family smallholdings in the central plateau, and the restructuring of large irrigated rice-growing areas.

Code
# Define a function to load and count surveys per observatory for a given year
load_and_count <- function(year, factorize = FALSE) {
  # Define file path
  file_path <- paste0("Données ROR/enter/", year, "/res_deb.dta")
  
  # Load data
  data <- read_dta(file_path)
  
  # Extract label and convert to factors if option
  if (factorize) {
    data <- data %>%
      mutate(across(everything(), as.character),
             across(where(is.labelled), ~ as.character(as_factor(.))))
  }
  
  # Count surveys per observatory
  count_data <- data %>%
    group_by(j0) %>%
    summarise(survey_count = n()) %>%
    ungroup() %>%
    mutate(year = year)  # Add year column
  
  return(count_data)
}

# Generate a list of years
years <- 1995:2014

# Use purrr::map_df to loop through each year and bind results
obs_count <- map_df(years, load_and_count) %>%
  # Remove rows with observatory "7 " and "NA", whch are errors
  filter(j0 != 7 & !is.na(j0)) %>%
  rename(observatory = j0)

# Read observatory names
observatory_names <- readxl::read_xlsx("Observatory_names.xlsx") %>%
  select(code, observatory_name = name)

# PAss it to wide.
obs_count <- obs_count %>%
  left_join(observatory_names, by = c("observatory" = "code")) %>%
  group_by(observatory_name, year) %>%
  summarise(survey_count = sum(survey_count))

obs_count_wide <- obs_count %>%
  pivot_wider(names_from = year, values_from = survey_count)

# Add observatory approximate location
locations <- tibble(
  code = c(1, 2, 3, 4, 12, 13, 15, 16, 21, 22, 23, 24, 31, 25, 41, 42, 43, 51, 
           44, 45, 61, 17, 18, 19, 71, 52),
  name = c("Antalaha", "Antsirabe", "Marovoay", "Toliara", "Antsohihy", 
           "Tsiroanomandidy-Bongo", "Farafangana", "Ambovombe", 
           "Alaotra", "Manjakandriana", "Toliara littoral", 
           "Fenerive Est", "Bekily", "Mahanoro", "Soaviandriana-Itasy", 
           "Menabe-Belo", "Fianarantsoa", "Tsivory", "Morondava", "Manandriana", 
           "Tanandava", "Ihosy", "Ambohimahasoa", "Manakara", "Tolanaro", 
           "Menabe-Nord-Est"),
  latitude = c(-14.8833, -19.8659, -16.1000, -23.3558, -14.8796, -18.7713, 
               -22.8167, -25.1667, -17.8319, -18.9167, -23.3558, -17.3500, 
               -24.6900, -19.9000, -19.1686, -19.6975, -21.4527, -24.4667, 
               -20.2833, -20.2333, -22.5711, -22.4000, -20.7145, -22.1333, 
               -25.0381, -20.5486),
  longitude = c(50.2833, 47.0333, 46.6333, 43.6683, 47.9875, 46.0546, 47.8333, 
                46.0833, 48.4167, 47.8000, 43.6683, 49.4167, 45.1700, 48.8000,
                46.7354, 44.5419, 47.0857, 45.4667, 44.2833, 47.3833, 45.0439, 
                46.1167, 47.0389, 48.0167, 46.9562, 47.1597))

obs_count <- left_join(obs_count, locations, by = c("observatory_name" = "name"))


madagascar <- st_read("../Entités administratives/OCHA_BNGRC frontières administratives/mdg_admbnda_adm0_BNGRC_OCHA_20181031.shp",
                      quiet = TRUE)

# Sort locations by latitude to generate sequence numbers
locations <- locations %>%
  arrange(desc(latitude)) %>%
  mutate(seq_num = 1:n())

# Create map plot with labels
map_plot <- ggplot(data = madagascar) +
  geom_sf(fill = "lightgray") +
  geom_point(data = locations, aes(x = longitude, y = latitude, color = name), 
             size = 3) +
  geom_text(data = locations, aes(x = longitude, y = latitude, label = seq_num), 
            vjust = -1, hjust = 1, size = 3) + 
  theme_void() +
  theme(legend.position = "none")

# Add sequence numbers to observatory names in obs_count dataframe
obs_count <- obs_count %>%
  left_join(locations %>%
              select(name, seq_num), 
            by = c("observatory_name" = "name")) %>%
  mutate(observatory_with_num = paste0(seq_num, ". ", observatory_name))

# Create timeline plot using modified obs_count with observatory_with_num
timeline_plot <- ggplot(obs_count, 
                        aes(x = year, 
                            y = fct_reorder(observatory_with_num, latitude), 
                            color = observatory_name)) +
  geom_point(aes(size = survey_count), show.legend = F) +
  theme_minimal() +
  labs(y = NULL, x = NULL) +
  theme(axis.text.y = element_text(size = 8),
        legend.position = "none")

# Stitch the plots together
combined_plot <- plot_grid(map_plot, timeline_plot, rel_widths = c(1.3, 2))

ggsave("../../Protocole/Figures/ROR_history.png", plot = combined_plot, 
       width = 10, height = 7, dpi = 300)

print(combined_plot)

Figure 1: Coarse location of rural observatory and survey years

This set-up was particularly significant for rural development and macroeconomic policy decisions. In the mid-1990s, This initiative was motivated by an urgent need to compile statistics on the agricultural industry due to the breakdown of the existing statistical system. This became crucial as Madagascar experienced significant changes in its rural areas since 1980, particularly with the liberalization of pricing and marketing structures. The Madagascar ROR arose as a part of the broader trend of socio-economic observatories that emerged in sub-Saharan Africa in the 1980s. These observatories are information systems designed to assess changes in society and their impact on specific populations, emphasizing poverty reduction, combating exclusion, and promoting sustainable human development.

Challenges for opening the data

In this notebook, our primary objective is to enhance the georeferencing of the ROR survey data for open data sharing. The initial ROR survey, initiated in 1995, recorded geographical information in varying formats: from “village” to a combination of “municipality”, “village”, and “site”. A significant challenge arose from the fact that while data collection started in 1995, municipalities were only formally established in 1994, with several years required for stabilization. The inherent fluidity in toponyms, predominantly derived from oral traditions, resulted in varied written representations. Our endeavor is to identify, disambiguate, and georeference observations recorded in the ROR data, adopting the Common Operational Datasets (CODs) as a reference, which has been collaboratively defined by OCHA and Madagascar’s BNGRC (National Disaster Management Office).

CODs stand as the bedrock for all preparedness and response operations, especially within the humanitarian sector. Adopted by the IASC in 2008 and revised in 2010, these datasets are pivotal for facilitating informed decision-making during the critical initial hours of a crisis. By ensuring consistency among stakeholders, they simplify data management and establish a shared operational picture of a crisis. Particularly relevant for our purpose is the incorporation of P-codes in CODs. These unique geographic identification codes, found in both Administrative Boundary CODs (COD-ABs) and Population Statistics CODs (COD-PSs), surmount challenges posed by variations in placenames and spellings. For instance, in Madagascar, 81 different administrative level 4 (ADM4) features are labeled “Morafeno”, with six of these existing within ADM3 features also termed “Morafeno”, distinguishable solely by their unique ADM2 features.

P-codes act as reliable geographic identifiers, eliminating errors arising from identical or variably spelled geographic locations. Leveraging the HDX platform, an open platform for cross-crisis data sharing, we fetch this data to ensure the accurate georeferencing of our ROR data. By harnessing the standardized and official spelling of places provided by P-codes, we can amalgamate, harmonize, and analyze data from diverse sources, offering a comprehensive, geo-accurate view of the survey’s findings.

Data Description and Initial Exploration with R

ROR survey data

The ROR survey data is organized in a collection of year-specfic folders ranging from 1995 to 2015. Each yearly folder houses multiple .dta files (Stata data format) – about 85 per year – with diverse filenames such as “res_as.dta” and “res_bp.dta”. Although there’s an element of harmonization achieved, especially regarding certain variables and household identifiers, the data varies in terms of geographical granularity. Initial years primarily provide a singular field denoting the village name. As the years progress, this evolves to include a municipality name, and in the latter years, an additional “site” name occasionally appears. A comprehensive overview of the observations can be gleaned from the “res_deb.dta” files within each year.

Code
# Function to extract variable info for a given year and file
extract_variable_info <- function(year, file) {
  
  file_path <- paste0("Données ROR/enter/", year, "/", file)
  
  if (!file.exists(file_path)) return(tibble())
  
  data <- read_dta(file_path, n_max = 0)
  
  tibble(
    file_name = file,
    variable_name = names(data),
    variable_label = var_label(data) %>% as.character(),
    year = year)
}

# Obtain all years from the directory structure
years <- list.dirs("Données ROR/enter/", recursive = FALSE, full.names = FALSE)

# Use the tidyverse approach to map over years and files
all_vars <- map_df(years, ~{
  files_for_year <- list.files(paste0("Données ROR/enter/", .x), pattern = "\\.dta$", full.names = FALSE)
  map_df(files_for_year, extract_variable_info, year = .x)
})

# Convert any NULL values in variable_label to "NA"
all_vars$variable_label[is.na(all_vars$variable_label)] <- "NA"

# Consolidate the information using the tidyverse approach
variable_dictionary <- all_vars %>%
  group_by(file_name, variable_name) %>%
  arrange(year) %>%  
  summarise(
    variable_label = first(variable_label[variable_label != "NA"] %||% "NA"),
    years_present = list(unique(year))) %>%
  ungroup() %>%
  mutate(years_present = map_chr(years_present, ~ paste(.x, collapse = ",")))

# Write the variable dictionary to an Excel file
write_xlsx(variable_dictionary, "ROR_Variable_Dictionary.xlsx")

The code block above creates a data dictionnary, which you can download by clicking on this link.

Administrative boundaries

The “Madagascar Subnational Administrative Boundaries” dataset is sourced from the Common Operational Datasets (CODs), which offer authoritative reference datasets for decision-making during humanitarian operations. Specifically designed to streamline the discovery and exchange of pivotal data, CODs ensure uniformity and use the ‘best available’ datasets. This particular dataset focuses on administrative boundaries, including gazetteers with P-codes, facilitating organized humanitarian assessments and data management. P-codes act as unique identifiers for every administrative unit and populated area, ensuring standardization in nomenclature. When datasets adhere to the P-code standard, their integration and analysis become efficient. The dataset provides comprehensive boundary information for Madagascar at five administrative levels: country, region, district, commune, and fokontany. It’s accessible for download as shapefiles from the provided link.

Localities

The “Madagascar Populated Places” dataset is also part of the Common Operational Datasets (CODs). This dataset encompasses populated place points for Madagascar. The data has been sourced from the National Geospatial-Intelligence Agency and provided by the University of Georgia - ITOS. Further, the Geographic Information Support Team (GIST) has taken up the role of distributor, with the data being published on 2007-03-07. UN OCHA ROSA has enhanced the dataset by adding P-codes and administrative boundary names, which are based on the BNGRC (National Disaster Management Office) data. The dataset geolocates 28184 populated places with their toponyms (names), codes related to various administrative levels such as fokontany, commune, district, and region, and their spatial coordinates.

Code
# Load datasets
ROR_surveys_2007 <- read_dta("Données ROR/enter/2007/res_deb.dta")
observatory_names <- read_xlsx("Observatory_names.xlsx")
obs_communes <- st_read(
  "data/observatoires/Observatoires_ROR_communes_COD.gpkg",
  quiet = TRUE) %>%
  left_join(select(observatory_names, name, code), 
            by = c("OBS_NAME" = "name")) %>%
  rename(OBS_CODE = code)
pop_places <- st_read(
  "../entités administratives/OCHA_BNGRC populated places/mdg_pplp_places_NGA_OCHA.shp", quiet = TRUE)

Methodology

Simplify strings

The treatment of toponyms presents a unique challenge, especially when these names are captured from varied sources. In the dataset, these names can vary due to differences in languages, case sensitivity, and the inclusion of additional descriptive terms. To address this, the clean_string function was developed. This function begins by converting all strings to lowercase, ensuring that subsequent comparisons are not sensitive to case variations. Next, to create a uniform standard, all non-alphanumeric characters are removed, retaining only spaces and the alphanumeric content. Common qualifiers in toponyms, such as “centre”, “haut” (high) or “bas” (low), which are not used consistently across records, are also removed. Given the bilingual nature of the dataset, with entries potentially in both Malagasy and French, the function translates cardinal points to the Malagasy language to ensure uniformity. Lastly, certain locales with multiple names, such as “Fort Dauphin”, also known as “Taolagnaro”, “Tolagnaro” or “Tolanaro”, are standardized to a single term, “Tolanaro”, to eliminate potential disparities. We also address instances of Roman numerals from I to VI, converting them to their Arabic numeral counterparts, ensuring consistent representation across records.

Code
clean_string <- function(x){
  x %>%
    tolower() %>% # Convert to lowercase
    # Retain spaces, remove other non-alphanumeric characters
    str_replace_all("[^[:alnum:][:space:]]", " ") %>% 
    str_remove_all("\\b(centre|haut|bas|androy)\\b") %>%
    str_trim() %>% # Trim spaces from start and end of string
    str_replace_all("\\bcentre\\b", "") %>% # Remove the word 'centre'
    # Translate cardinal points
    str_replace_all("\\bnord\\b", "avaratra") %>% 
    str_replace_all("\\best\\b", "atsinanana") %>%
    str_replace_all("\\bouest\\b", "andrefana") %>% 
    str_replace_all("\\bsud\\b", "atsimo") %>% 
    str_replace_all("\\batsinana\\b", "atsinanana") %>% # Replace short form 
    str_replace_all("(fort dauphin)|(taolagnaro)|(tolagnaro)", 
                    "tolanaro") %>% # Variations for fort dauphin
    # Convert Roman numerals to Arabic
    str_replace_all("\\bi\\b", "1") %>% 
    str_replace_all("\\bii\\b", "2") %>%
    str_replace_all("\\biii\\b", "3") %>% 
    str_replace_all("\\biv\\b", "4") %>% 
    str_replace_all("\\bv\\b", "5") %>% 
    str_replace_all("\\bvi\\b", "6")
}

Fuzzy matching

By default, statistical softwares and computing language match text by pairing only identical strings. Exact string matching is inappropriate in our context, where location data entry was subject to human errors like typographical mistakes or minor variations in spelling. To avoid this rigidity, fuzzy matching is employed. This approach gauges the degree of similarity between two strings, bypassing the need for an exact character-to-character match. The principle metric adopted for this is the “Levenshtein distance”, which quantifies the minimum number of single-character edits required to change one string into another. The fuzzy_match function encapsulates this approach. The function initiates the process by filtering the reference list of encontered toponyms based on a given observatory code, which considerably narrows down potential matches. Then, using the Jaro-Winkler distance metric — a variant of the Levenshtein distance particularly suited for shorter strings — the function computes the similarity between the target string and entries in the filtered reference. To ensure that only relevant matches are acknowledged, a threshold, termed max_distance, is set. Matches that exceed this threshold are disregarded. For those that pass this validation, the function then extracts the pertinent details of the matched row from the reference dataframe.

Code
fuzzy_match <- function(target_string, dataframe, column_name, observatory_code, 
                        max_distance = 0.25) {
  # Filter the dataframe based on observatory_code
  filtered_reference <- dataframe %>%
    filter(OBS_CODE == observatory_code) %>%
    select(all_of(column_name), ADM3_PCODE, ADM3_EN)
  
  # If filtered_reference is empty, return NA values
  if (nrow(filtered_reference) == 0) {
    return(list(matched_string = NA, ADM3_PCODE = NA, ADM3_EN = NA, distance = 1))
  }
  
  # Use stringdist to find the closest match
  distances <- stringdist::stringdistmatrix(target_string, filtered_reference[[column_name]], method = "jw")
  
  # If there are no valid distances, set min_distance to Inf (this should help avoid the error)
  if(all(is.na(distances))) {
    min_distance <- Inf
  } else {
    min_distance <- min(distances, na.rm = TRUE)  # Ensure NA values don't affect the min calculation
  }
  
  # Check for Inf distance and replace it with 1
  if (is.infinite(min_distance)) {
    min_distance <- 1
  }
  
  # If min_distance exceeds the max_distance threshold, return NA values
  if (min_distance > max_distance) {
    return(list(matched_string = NA, ADM3_PCODE = NA, ADM3_EN = NA, distance = NA))
  }
  
  matched_row <- filtered_reference[which.min(distances), ]
  
  return(list(matched_string = matched_row[[column_name]], 
              ADM3_PCODE = matched_row$ADM3_PCODE, 
              ADM3_EN = matched_row$ADM3_EN, 
              distance = min_distance))
}

Hierarchical matching of data

The hierarchical organization of spatial entities is key for our challenge. Such an organization allows for a cascading representation of data, from broader scopes narrowing down to more specific layers. This representation is reminiscent of the defining order of geographical entities: regions contain provinces, which contain municipalities, and these in turn contain localities. In the georeferencing context, leveraging this hierarchical structure can lead to more precise matches. For instance, if an observatory code is associated with a specific district, the search for matches is confined to that district, enhancing both the efficiency and accuracy of the process. Figure 2 represents this hierarchical arrangement, serving as a roadmap for the subsequent data matching tasks.

graph TD

A[Legend]

W[ ]
X[ ]
Y[ ]
Z[ ]

W -->|Pre-definede hierarchy| X
Y -.-|Implemented matching|Z

graph TD

%% Administrative Logic
subgraph "Administrative Logic"
A[Region]
B[District]
C[Commune]
D[Fokontany]
I[Populated places]

A --> B
B --> C
C --> D
D --> I
end

%% Observatory Logic

subgraph "Observatory Logic"
E[Observatory network]
F[Observatory]
G[Commune]
H[Village]
J[Site]

E --> F
F --> G
G --> H
F --> H
H--> J
end

%% Geospatial matching
H -.- I
C -.- G

Figure 2: Spatial entities pre-defined relationship and matching

Madagascar’s current administrative setup is straightforward: regions contain districts; districts have communes; and communes are made up of Fokontany. Though a Fokontany should in principle be a single village, it often includes multiple villages or populated areas. It’s worth noting that while the idea of communes has been around for a while, they were only officially recognized in 1994. However, rolling them out took some time after 1994. Before 1994, “communes” simply described local government areas without any formal administrative status. The regions were created in 2004. On the ROR side, observatory refereed during the first surveys to villages. A systematic registry of the communes only started in the 2004 and 2005 round, depending on the observatories. A mention to “sites” also appeared in 2011 but was scarcely documented. Our strategy was to established links between the village information and populated places, and between communes.

A Detailed walk-through

We now break down our approach to describe each subsequent step. We began by exploring the ROR documentation, and in particular the reports associated to community survey, that contain descriptions of the area surveyed by each observatory. While doing so, we updated the COD subnational administrative boundary dataset by adding a field named OBS_Y_N. This field was marked as ‘1’ if the municipality was listed in an observatory survey; if not, it was left empty. Additionally, we added another field, OBS_NUM, which would store the observatory number. If the municipality wasn’t part of any survey, this field was left empty. The list of observatories and surveyed municipalities can be found in the Table 2.

Next, we moved on to geolocation, which took place in four stages. Each stage depended heavily on the data quality and completeness obtained from the previous ones.

Method 1

This phase entailed a systematic alignment process for the municipalities. When municipality names were included in our dataset, an attempt was made to correlate them with names of municipalities identified as data collection locales. For enhanced precision, this alignment was conducted specific to each observatory. Given the potential for variance in nomenclature across sources, a fuzzy matching algorithm was employed. This method calculates the probability of two distinct names referencing the same entity. Upon identifying a probable match, we performed a visual verification for all matches and flagged the false positives to remove them.

The following code segment elucidates the methodology applied:

Code
# List of years
years <- 1995:2014

# Read all datasets and combine
all_surveys_description <- map_df(years, function(year) {
  df <- read_dta(paste0("Données ROR/enter/", year, "/res_deb.dta"))
    # Convert all columns to character to ensure consistency
  df <- df %>% mutate_all(as.character)
  return(df)
})

# Extract unique combinations and list all the years they appeared in
unique_combinations <- all_surveys_description %>%
  group_by(j0, j42, j4) %>%
  summarize(years = toString(unique(year)),
            obs_count = n()) %>%
  ungroup()

# Harmonize the fields that contain municipality or village names
unique_combinations <- unique_combinations %>%
  mutate(clean_muni = clean_string(j42),
         clean_village = clean_string(j4))
obs_communes <- obs_communes %>%
  mutate(clean_ADM3 = clean_string(ADM3_EN))
pop_places <- pop_places %>%
  mutate(clean_pname = clean_string(PLACE_NAME)) 

# List of observatories for which municipalities have been identified
identified_observatories <- unique(obs_communes$OBS_CODE) %>%
  na.omit()
# Filter for the observatory for which we already have a manual identification
# of municipalities
unique_combinations <- unique_combinations %>%
  filter(j0 %in% identified_observatories) 

# Apply the fuzzy matching observatory-wise
results <- map2_df(unique_combinations$clean_muni, 
                   unique_combinations$j0, 
                   ~ as.data.frame(t(
                     fuzzy_match(.x, obs_communes, "clean_ADM3", .y))))  %>%
  unnest(cols = c(matched_string, distance, ADM3_PCODE, ADM3_EN))

# Combine the results with the unique_combinations
correspondence_table <- bind_cols(unique_combinations, results) 

# Add a column for the matching method
correspondence_table <- correspondence_table %>%
  mutate(method = ifelse(!is.na(matched_string), "method_1", NA_character_))

To synthesize and visually represent our matches, the identified municipalities are plotted on a map:

Code
# Extract matched resuts
matched_results <- unique(correspondence_table$ADM3_PCODE) %>%
  na.omit()
# Keep municipalities in those
matched_spatial <- obs_communes %>%
  filter(ADM3_PCODE %in% matched_results)

tmap_mode("view")
# Plot
tm_shape(matched_spatial) +
  tm_polygons(col = "OBS_NAME")

Method 2: Extracting municipality names from text in “village names”

For observations where the field “municipality name” was empty (all cases before 2004 and still a frequent situation afterwards), we turned our attention to the village name field. The objective was to decipher whether these village names could potentially contain a municipality’s name that had been previously pinpointed during the observatory surveys. For the observations with no municipality value, we teased out the first word or segment in the village name. This extracted word then underwent a process of fuzzy matching against the roster of identified municipalities’ names. However, manual verification spotted errors. Certain matches, which we labeled ‘false positives’, emerged as errors and were purged from the results. Once this cleansing step was concluded, we included the validated matches back into the primary correspondence table.

Code
# Filter out matched municipalities and extract the first word
unmatched_results <- correspondence_table %>%
  filter(is.na(matched_string)) %>%
  select(j0:clean_village) %>%
  mutate(first_word = str_extract(clean_village, "^[^\\s/]+"))

# Fuzzy matching with the first word and identified municipalities
results_step2 <- map2_df(unmatched_results$first_word, 
                         unmatched_results$j0, 
                         ~ as.data.frame(t(
                           fuzzy_match(.x, obs_communes, "clean_ADM3", .y))))  %>%
  unnest(cols = c(matched_string, distance, ADM3_PCODE, ADM3_EN))

# Update Results
potential_matches2 <- bind_cols(unmatched_results, results_step2)

# Manually identify false positives and remove them
false_positives <- c("madiromionga", "maroarla", "tsaratanteraka", 
                    "ambatoharanana", "ambatoaranana", "maroala", "erakoja", 
                    "erakoka", "maroalo", "erakka", "erakoa")
validated_matches2 <- potential_matches2 %>%
  mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),
               ~ ifelse(first_word %in% false_positives, NA, .)),
         method = ifelse(!is.na(matched_string), "method_2", NA_character_))

# Integrate new results in correspondence table
correspondence_table <- correspondence_table %>%
  filter(!is.na(matched_string)) %>%
  bind_rows(validated_matches2) %>%
  select(-first_word)

Method 3: Village name fuzzy matching with populated places

With some observations still devoid of a matched municipality, we initiated another processing layer. This phase saw the unmatched village names being subjected to fuzzy matching against the populated places dataset described above. We first sifted out the unmatched results from our prior analysis. Then, we augmented the ‘pop_places’ dataset with observatory codes for the entries that were located in communes recognized as surveyed by observatories. Next we performed another fuzzy matching, comparing the village names against the ‘pop_places’ names. To maintain a level of precision, we set a restrictive threshold, dismissing any match that exceeded a certain “distance” or degree of difference. But, as with the previous methods, visual verification revealed some missmatches. These ‘false positives’ were flagged and discarded and the remaining ones were integrated into the overarching correspondence table.

Code
# Re-filter unmatched results
unmatched_results2 <- correspondence_table %>%
  filter(is.na(matched_string)) %>%
  select(j0:clean_village)

# Join OBS_CODE to pop_places
pop_places <- pop_places %>%
  rename(ADM3_PCODE = COM_PCODE, ADM3_EN = COMMUNE) %>%
  mutate(ADM3_PCODE = str_replace(ADM3_PCODE, "^MDG", "MG")) %>%
  left_join(select(obs_communes, ADM3_PCODE, OBS_CODE) %>%
                     st_drop_geometry(), 
            by = "ADM3_PCODE")

# Apply the fuzzy matching observatory-wise
results_step3 <- map2_df(unmatched_results2$clean_village, 
                         unmatched_results2$j0, 
                         ~ as.data.frame(t(
                           fuzzy_match(.x, pop_places, "clean_pname", .y,
                                       max_distance = 0.22))))  %>% # erratic results beyond
  unnest(cols = c(matched_string, distance, ADM3_PCODE, ADM3_EN))

# Bind the results with unmatched_results_v2
potential_matches3 <- bind_cols(unmatched_results2, results_step3)

# Manually identify false positives and remove them
false_positives2 <- c("analambarika", "maroaloka", "ambakela", "ambodirofia",
                      "ambohibao", "ambato mangabe", "marofonaritra", 
                      "andranovo ambodimanga", "morataitra", "antanambao", 
                      "ankililoa")
validated_matches3 <- potential_matches3 %>%
  mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),
               ~ ifelse(clean_village %in% false_positives2, NA, .)),
         method = ifelse(!is.na(matched_string), "method_3", NA_character_))

correspondence_table <- correspondence_table %>%
  filter(!is.na(matched_string)) %>%
  bind_rows(validated_matches3)

Method 4

For the remaining, let’s try matching with other village names for which municipality has been matched.

Even after the above measures, there remained observations that had eluded a municipality match. The subsequent strategy was to juxtapose them with other village names that had already been successfully matched to a municipality. We assembled a dataset comprising of municipality names and their paired village names, using data from previous successful matches. UThen the still-unmatched village names were run through a fuzzy matching process against the known village names. We fine-tuned the algorithm with a restrictive matching threshold. Again, some matches stood out as anomalies. Labeled as ‘false positives’, these were sifted out and the newly matched data was incorporated into the main correspondence table.

Code
# Create a list of municipality names and village names for matched observations
matched_villages <- correspondence_table %>%
  filter(!is.na(method)) %>%
  select(j0, ADM3_PCODE, ADM3_EN, clean_village) %>%
  distinct() %>%
  rename(OBS_CODE = j0)

# Re-filter unmatched results
unmatched_results3 <- correspondence_table %>%
  filter(is.na(method)) %>%
  select(j0:clean_village)

# Try matching unmatched villages against matched village names observatory-wise
results_village_match <- map2_df(
  unmatched_results3$clean_village, unmatched_results3$j0,
  ~ as.data.frame(t(fuzzy_match(.x, matched_villages, "clean_village",
                                .y, max_distance = 0.28)))) %>%
  unnest(cols = c(matched_string, distance, ADM3_PCODE, ADM3_EN))
# Bind these results with unmatched_results_v3
potential_matches4 <- bind_cols(unmatched_results3, results_village_match)

# Manually identify false positives and remove them
false_positives4 <- c("amp0mbibitika antanakova", "arakoke ambonano ampihamibe", 
                      "farara farara", "farara ambakela", "fara ambakela", 
                      "ambatotelo marofinaritra")
validated_matches4 <- potential_matches4 %>%
  mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),
               ~ ifelse(clean_village %in% false_positives4, NA, .)),
         method = ifelse(!is.na(matched_string), "method_4", NA_character_))


# Merge the updated results back to the correspondence table
correspondence_table <- correspondence_table %>%
  filter(!is.na(matched_string)) %>%
  bind_rows(validated_matches4)

Matching villages

After successfully matching commune names with an official, normalized reference, our next goal is to align the villages as identified by the ROR with their administrative counterpart, known as Fokontany. This task presents challenges, notably because the ROR’s village identifications weren’t directly linked to Fokontany names, and the Fokontany names themselves have undergone changes. Additionally, variations in village naming within the ROR dataset complicate the process. Our approach involves several steps: First, we load the official COD Fokontany dataset. Next, we standardize the place names using our previously discussed clean_string function. Then, using the purpose-built fuzzy_match_village function, we assess the similarity between the village names in the ROR dataset and the Fokontany names within the same commune, specifically by calculating the Jaro-Winkler distance between the strings. We the names with the highest degree of matching. During a manual review of these results, we closely examined the matches, especially those with larger Jaro-Winkler distance. From this examination, we derived a distance of 0.2, below which the majority of matches were accurate. Notably, some matches above this threshold were deemed accurate and retained, while a small number of matches below this threshold were identified as false positives and subsequently discarded.

Code
# Load the ADM4 dataset, which corresponds to Fokontany
fokontany <- st_read("../Entités administratives/OCHA_BNGRC frontières administratives/mdg_admbnda_adm4_BNGRC_OCHA_20181031.shp",
                      quiet = TRUE) %>%
  mutate(ADM4_clean = clean_string(ADM4_EN))

# A function to perform fuzzy matching with Fokontany or populated places
fuzzy_match_village <- function(target_string, dataframe, column_name, 
                                ADM3_PCODE_filter, id_column, 
                                municipality_string = NULL, max_distance = 0.4) {
  
  # Check if target_string has multiple words and if the first word matches the municipality_string
  if (!is.null(municipality_string) && length(unlist(strsplit(target_string, " "))) > 1) {
    first_word <- unlist(strsplit(target_string, " "))[1]
    distances_municipality <- stringdist::stringdist(first_word, municipality_string, method = "jw")
    if (min(distances_municipality, na.rm = TRUE) <= max_distance) {
      target_string <- str_remove(target_string, paste0("^", first_word, " "))
    }
  }
  
  filtered_reference <- dataframe %>%
                          filter(ADM3_PCODE == ADM3_PCODE_filter) %>%
                          select(all_of(column_name), all_of(id_column))
  
  # If filtered_reference is empty, return NA values
  if (nrow(filtered_reference) == 0) {
    return(list(matched_string = NA, ID = NA, geometry = NA, distance = 1))
  }
  
  # Use stringdist to find the closest match
  distances <- stringdist::stringdistmatrix(
    target_string, filtered_reference[[column_name]], method = "jw")
  
  # If there are no valid distances, set min_distance to Inf
  if(all(is.na(distances))) {
    min_distance <- Inf
  } else {
    min_distance <- min(distances, na.rm = TRUE)
  }
  
  # Check for Inf distance and replace it with 1
  if (is.infinite(min_distance)) {
    min_distance <- 1
  }
  
  # If min_distance exceeds the max_distance threshold, return NA values
  if (min_distance > max_distance) {
    return(list(matched_string = NA, ID = NA, geometry = NA, 
                distance = NA))
  }
  
  matched_row <- filtered_reference[which.min(distances), ]
  
  return(list(matched_string = matched_row[[column_name]], 
              ID = matched_row[[id_column]], 
              distance = min_distance))
}
# correspondence table without municipality geometry
correspondence_table2 <- correspondence_table %>%
  st_drop_geometry()

# For fokontany
results_fokontany <- pmap_df(
  list(clean_village = correspondence_table2$clean_village, 
       ADM3_PCODE = correspondence_table2$ADM3_PCODE, 
       municipality_string = correspondence_table2$clean_muni),
  function(clean_village, ADM3_PCODE, municipality_string) {
    as.data.frame(t(fuzzy_match_village(clean_village, fokontany, 
                                        "ADM4_clean", ADM3_PCODE, "ADM4_PCODE", municipality_string)))
  }
) %>%
   unnest(cols = everything()) %>%
  rename(ADM4_PCODE = ID, ADM4_clean = matched_string, distance2 = distance) %>%
  st_drop_geometry() %>%
  select(-geometry)
 
# Combine the results with correspondence_table
correspondence_table2_updated <- bind_cols(correspondence_table2, 
                                          results_fokontany) %>%
  mutate(method_village = ifelse(!is.na(ADM4_PCODE), "method_5", NA_character_))


# For a visual validation
correspondence_table2_updated %>% 
  arrange(desc(distance2)) %>%
  write_xlsx("correspondence_table2_update.xlsx")

# After instpecting the data, we find that a 0.2 threshold in distance is 
# appropriate. However some valid matches above this threshold should be kept

high_tolerance_valid_matches <- c(
  "ambanja", "ambararata 2", "ambaro", "ambatoharanana", "ambatomanga", 
  "ambazoamazava", "ambazoamirafy", "ambodibonara", "ambodimotso atsimo", 
  "ambohidrony", "ambohimahatsinjo atsinanana", "ambongabe", "amparafaravola", 
  "ampasy", "ampijoroa", "ampijoroan ala", "ampitana", "analambarika", 
  "andohasoamahainty", "anjiamangirana 1", "ankazoabo anivo", "ankerereake", 
  "ankililoaka 2", "bepako", "bevato", "erada 2", "esanta fototra", 
  "esanta marofoty", "esanta maromainty", "feramanga atsimo", "habohabo atsimo", 
  "habohabo avaratra", "lafiatsinanana", "madiromiongana", "manambaro 2", 
  "manantenina", "mangabe", "mangaoka", "maritampona", "maroala", "marofoty", 
  "marolampy", "marovantaza", "miary", "miary ankasy", "miary ankoronga", 
  "miorimivalana", "tanambao", "tanambao 2", "tandroka andrefana", 
  "tsarahasina", "vohilengo")

# Below this threshold, a few false positives should be removed
low_tolerance_invalid_marches <- c(
  "AMBOHIMAHATSINJO EST", "AMBOHIMAHATSINJO-EST/CENTRE", 
  "AMBALAKINDRESY/ANDOHARENINA", "ambazoa /ampiha", "AMBALAKINDRESY/ANDOHARENA", 
  "ANKAZOABO SUD/ALAMBARIKE", "ANKARINEZAKA/NAMARINA", "AMBALA/ANDRANOLAVA", 
  "AMBALAKINDRESY/ANDRANOLAVA", "AMBALAKINDRESY/ANDRANOLAVA")

correspondence_table3 <- correspondence_table2_updated %>%
  mutate(
    across(c(ADM4_clean, ADM4_PCODE, distance2, method_village), 
           ~ case_when(
             (distance2 > 0.2 & !ADM4_clean %in% high_tolerance_valid_matches) |
             ADM4_clean %in% low_tolerance_invalid_marches ~ NA,
             TRUE ~ .)))

Validating the Quality of Georeferenced Data

To ensure the robustness and validity of our georeferenced data, we employ a multi-tiered validation approach that hinges on quantitative metrics and qualitative consistency checks.

Quantitative metrics

We first look at the distribution of the unique observations that were successfully matched during the various stages of the georeferencing process.

Code
# Re-format modality labels  
matching_tbl <- correspondence_table3 %>%
  mutate(method = case_when(
    method == "method_1" ~ "Method 1",
    method == "method_2" ~ "Method 2",
    method == "method_3" ~ "Method 3",
    method == "method_4" ~ "Method 4",
    is.na(method) ~ "Unmatched communes"),
    method_village = case_when(
      method_village == "method_5" ~ "Matched villages",
      is.na(method_village) ~ "Unmatched villages"))

summarized_data <- matching_tbl %>%
  group_by(method, method_village) %>%
  summarize(total_obs = sum(obs_count, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(percent_obs = total_obs/sum(total_obs))

count_tbl <- summarized_data %>%
  select(method, method_village, total_obs) %>%
  pivot_wider(names_from = method_village, values_from = total_obs, 
              values_fill = 0) %>%
  rowwise() %>%
  mutate(Total = sum(c(`Matched villages`, `Unmatched villages`))) %>%
  ungroup()  %>%
  adorn_totals("row")

percent_tbl <- summarized_data %>%
  select(method, method_village, percent_obs) %>%
  pivot_wider(names_from = method_village, values_from = percent_obs, 
              values_fill = 0) %>%
  rowwise() %>%
  mutate(`Total (%)` = sum(c(`Matched villages`, `Unmatched villages`))) %>%
  ungroup() %>%
  rename(`Matched (%)` = `Matched villages`, 
         `Unmatched (%)` = `Unmatched villages`) %>%
  mutate(across(where(is.numeric), ~ round(.x * 100, 2))) %>%
  adorn_totals("row")


summarized_data <- count_tbl %>%
  left_join(percent_tbl, by = "method") %>%
  relocate(method, `Matched villages`, `Matched (%)`, 
           `Unmatched villages`, `Unmatched (%)`,
           Total, `Total (%)`) %>%
  rename(`Commune matching method` = method)

gt(summarized_data)  %>%
  # Add a title
  tab_header(title = "Result of matching process",
             subtitle = "Number of survey observation per method") 
Table 1: Number of ROR observation which commune and village were matched
Result of matching process
Number of survey observation per method
Commune matching method Matched villages Matched (%) Unmatched villages Unmatched (%) Total Total (%)
Method 1 19341 43.82 4054 9.18 23395 53.00
Method 2 8114 18.38 771 1.75 8885 20.13
Method 3 5839 13.23 1464 3.32 7303 16.55
Method 4 1974 4.47 1359 3.08 3333 7.55
Unmatched communes 0 0.00 1222 2.77 1222 2.77
Total 35268 79.90 8870 20.10 44138 100.00

Consistency check with documentation

To ensure the comprehensiveness of our georeferencing process, we summarize the list of unique matches for villages from all the municipalities in the final dataset. This check will help ensure there are no glaring gaps in the matched data.

Code
fokontany_list <- correspondence_table3 %>%
  mutate(j0 = as.numeric(j0)) %>%
  left_join(observatory_names, by = c("j0" = "code")) %>%
  rename(OBS_NAME = name) %>%
  left_join(select(st_drop_geometry(fokontany), ADM4_PCODE, ADM4_EN),
            by = "ADM4_PCODE") %>%
  group_by(OBS_NAME, ADM3_EN) %>%
  summarize(ADM4_EN = list(unique(if_else(is.na(ADM4_EN), "Unknown", ADM4_EN)))) %>%
  ungroup() %>%
  mutate(ADM4_EN = map_chr(ADM4_EN, ~ str_c(.x, collapse = ", ")))

fokontany_list %>%
  gt() %>%
  # Set column labels
  cols_label(
    OBS_NAME = "Observatory Name",
    ADM3_EN = "Commune Name",
    ADM4_EN = "Fokontany Names") %>%
  # Add a title
  tab_header(title = "Summary of Fokontany List",
             subtitle = "Breakdown by Observatory and Commune") 
Table 2: List of fokontany per commune and observatory
Summary of Fokontany List
Breakdown by Observatory and Commune
Observatory Name Commune Name Fokontany Names
Alaotra Ambatondrazaka Suburbaine Unknown, Ambongabe
Alaotra Amparafaravola Unknown, Ambodimanga, Amparafaravola, Ampasimbola, Analamiranga, Maritampona, Ambondrona
Alaotra Feramanga Nord Unknown, Avaradrano
Alaotra Ilafy Feramanga Atsimo, Unknown, Mangabe, Marianina
Alaotra Morarano Chrome Unknown, Ambatomanga, Ambohidrony, Ambodiatafana, Ambodirano
Alaotra NA Unknown
Ambohimahasoa Ambalakindresy Antsiho, Unknown, Ambohimahatsinjo Est, Ankarinezaka, Andraina, Ambatofolaka
Ambohimahasoa Ampitana Ampitana, Andoharondra, Andohasoamahainty, Unknown, Ambalavola
Ambovombe Ambanisarika Marofoty, Ambanisarika, Ambolimoka, Unknown
Ambovombe Ambazoa Ambazoa II, Ambazoa I, Ampiha, Ambazoamazava, Ampaipaika, Halomboro
Ambovombe Ambonaivo Nagnalo, Beratro I, Beratro II, Unknown
Ambovombe Ambovombe Ambazoamirafy, Ambaro, Amboasary II, Unknown, Esanta Marofoty, Esanta Maromainty, Esanta Fototra, Bevato, Esanta IIi
Ambovombe Erada Erada II, Habohabo Sud, Unknown, Erada IIi, Habohabo Nord, Erada I
Ambovombe NA Unknown
Antsohihy Ambararata Ambararata II
Antsohihy Ambatosia Unknown, Ambatosia
Antsohihy Ambodimotso Sud Ambodimotso Sud, Marolampy Bas
Antsohihy Anjiamangirana Anjiamangirana I, Ambodimanga I, Ambohimarina
Antsohihy Ankerika Ankerika Centre, Ankerika Nord, Ankerika Sud
Antsohihy Marovantaza Marovantaza, Mangaoka
Antsohihy Tsarahasina Tsarahasina, Unknown, Besisika
Antsohihy NA Unknown
Fenerive Est Ambatoharanana Unknown, Ambodihazinina, Ambatoharanana, Ambohimanarivo
Fenerive Est Ampasimbe Manantsatrana Ambodisatrana, Ambanja, Unknown, Ambodibonara
Fenerive Est Ampasina Maningory Unknown, Amboditononona, Tanambao I, Andapabe, Ambatomitrozona
Fenerive Est Miorimivalana Miorimivalana, Unknown
Fenerive Est Vohilengo Vohilengo, Vohipenohely, Unknown
Fenerive Est NA Unknown
Marovoay Ankazomborona Madiromiongana, Unknown
Marovoay Antanimasaka Maroala, Ampijoroa
Marovoay Marovoay Unknown, Ampijoroan'ala
Marovoay Tsararano Unknown, Bepako
Marovoay NA Unknown
Tolanaro Ampasy Nahampoana Unknown, Ampasy Centre
Tolanaro Enakara-Haut Unknown, Enakara Haut
Tolanaro Fort-Dauphin Ambinanibe, Unknown
Tolanaro Mahatalaky Unknown
Tolanaro Manambaro Manambaro I, Unknown, Manambaro II
Tolanaro Manantenina Manantenina Haut, Unknown
Tolanaro Mandromondromotra Mandromondromotra, Hovatraha, Unknown
Tolanaro Ranopiso Ranopiso, Unknown
Tolanaro Soanierana Lafiatsinanana, Unknown, Andranara
Toliara littoral Ankazoabo Sud Analambarika, Unknown, Ankerereake, Atsimon-Dalana, Ankazoabo Anivo, Miary Centre, Tanambao II, Tandroka Ouest
Toliara littoral Ankililoaka Tanambao, Ambondro, Ampagnolora, Unknown, Ankililoaka II
Toliara littoral Miary Ambohibola Miary Ankoronga, Unknown, Miary Ankasy
Toliara littoral NA Unknown

We compare this with the ROR documentation to assess which villages might have been omitted by our algorithm.

Visual exploration and avenues for continuous corrections

We visually explore the the georeferenced data to observe its spatial distribution and spot eventual anomalies or clustering patterns that might not be immediately obvious from numerical summaries. The visualization below maps out our matched municipalities alongside villages. Discrepancies or clusters may indicate a need for further refinement or correction.

Code
commmune_pcode_years <- correspondence_table3 %>%
  # Split and unnest the years
  mutate(years = str_split(years, ", ")) %>%
  unnest(years) %>%
  # Convert years to numeric for proper sorting
  mutate(years = as.numeric(years)) %>%
  # Group by ADM3_PCODE and extract unique, sorted years
  group_by(ADM3_PCODE) %>%
  summarize(years = list(unique(years))) %>%
  ungroup() %>%
  # Sort and collapse the years
  mutate(years = map_chr(years, ~ paste(sort(.x), collapse = ", ")))

fokontany_pcode_years <- correspondence_table3 %>%
  # Split and unnest the years
  mutate(years = str_split(years, ", ")) %>%
  unnest(years) %>%
  # Convert years to numeric for proper sorting
  mutate(years = as.numeric(years)) %>%
  # Group by ADM3_PCODE and extract unique, sorted years
  group_by(ADM4_PCODE) %>%
  summarize(years = list(unique(years))) %>%
  ungroup() %>%
  # Sort and collapse the years
  mutate(years = map_chr(years, ~ paste(sort(.x), collapse = ", ")))

selected_communes <- commmune_pcode_years %>%
  filter(!is.na(ADM3_PCODE)) %>%
  left_join(obs_communes, by = "ADM3_PCODE") %>%
  st_sf()

selected_fokontany <- fokontany_pcode_years %>%
  filter(!is.na(ADM4_PCODE)) %>%
  left_join(fokontany, by = "ADM4_PCODE") %>%
  st_sf()  %>%
  mutate(label = "Surveyed______")

Vahatra <- st_read("../Aires protégées/Vahatra/Shapefiles/AP_Vahatra.shp", 
                   quiet = TRUE) %>%
  st_make_valid() %>%
  mutate(label = "Protected area__")


tmap_mode("view")
tm_shape(selected_communes) + 
  tm_fill(col = "OBS_NAME", palette = "Set1", title = "Observatory",
          id = "ADM3_EN",
          popup.vars = c("Data collection years"  = "years",
                         "District" = "ADM2_EN",
                         "Region" = "ADM1_EN")) +
  tm_shape(selected_fokontany) + 
  tm_fill(col = "label", palette = c("black"), alpha = 0.6, title = "Fokontany", 
          id = "ADM4_EN",
          popup.vars = c("Data collection years"  = "years",
                         "Commune" = "ADM3_EN",
                         "District" = "ADM2_EN",
                         "Region" = "ADM1_EN")) +
  tm_shape(Vahatra) + 
  tm_fill(col = "label", palette = c("darkgreen"), alpha = 0.4,
          popup.vars = c("Year creation" = "an_crtn",
                         "IUCN category" = "cat_icn",
                         "Manager" = "gest_1",
                         "Manager 2" = "gest_2")) 

Figure 3: Map of the surveyed communes and fokonany in protected areas vicinity

Conclusion

Given the variability in location labels recorded in the ROR data, we emphasized standardizing and aligning toponyms. We introduced the clean_string function to unify place names, addressing variances in language, letter casing, and other descriptors. Utilizing fuzzy matching, we connected similar text strings, using the Levenshtein distance as a benchmark. To heighten accuracy, a hierarchical data matching strategy was applied, reflecting the inherent structure of geographical units, where regions envelop districts, which further consist of municipalities and localities. The COD data, curated by BNGRC and OCHA, was our reference point for these matches. Various techniques, including hierarchical fuzzy matching and visual checks, were employed.

As a result, 98% of commune name variations in the dataset matched with an official commune, and 73% of village name variations correlated with a fokontany. The outcomes are consistent with ROR’s internal documentation, and the spatial representations can be further inspected for validation. Future corrections and enhancements are anticipated, and our system, leveraging Quarto notebook and reproducible R code, is poised to facilitate these refinements.

ROR began its data collection at a time when geolocation best practices weren’t as evolved as they are today. Should more survey rounds be conducted in the future, the intricate georeferencing process demonstrated here should serve both as a cautionary tale and a guide on optimizing data registration (for instance, using PCODES). We’re optimistic that refining the georeferencing of past ROR data, as presented in this work, will amplify the data’s utility for both research and policy decisions.